🔍 Analysis

survey0 <- read_rds(here::here("raw_data/survey_data.rds"))

Identify and remove the direct identifiers from the data.

Records are not being shuffled so that the data custodian can verify the atomicity/correctness of the records in the published data.

survey1 <- survey0 %>%
  select(-c(IPAddress,ResponseID,ResponseFirstName, ResponseLastName, QID28)) 

#shuffling rows so that it does not match original raw_data order.
set.seed(321)
rndm_order <- sample(nrow(survey1))
#survey1 <- survey1[rndm_order,] not shuffling so that the data custodian can verify the atomicity/correctness of the records in the published data.

De-identification strategy

#converting to required data types

survey1 <- survey1 %>%
  mutate(
    StartDate = ymd(StartDate),
    EndDate = ymd(EndDate),
    Duration = as.numeric(Duration),
    RecordedDate = ymd(RecordedDate),
    QID6 = as.numeric(QID6),
    QID7 = as.numeric(QID7),
    QID8 = as.numeric(QID8),
    QID19 = as.numeric(QID19),
    QID21 = as.numeric(QID21),
    QID22 = as.numeric(QID22)
  )
#imputing missing demographics for unfilled forms
survey2 <- survey1 %>%
  mutate(QID6=case_when(Progress<10  ~ median(QID6, na.rm = TRUE),
                        TRUE ~ QID6),
         QID7=case_when(Progress<10 ~ median(QID7, na.rm = TRUE),
                        TRUE ~ QID7),
         QID8=case_when(Progress<10 ~ median(QID8, na.rm = TRUE),
                        TRUE ~ QID8),
         QID15=case_when(Progress<10 ~ median(QID15, na.rm = TRUE),
                        TRUE ~ QID15)
         )
survey3 <- survey2 %>%
  mutate(age_group = as.character(cut(QID6,breaks = c(21,40,60,88)))) %>%  #generalization
  mutate(income = QID22 + rnorm(n(),0,5000)) %>% #perturbation
  mutate(income2019 = ifelse(income>200000, 220000, income)) %>% #top coding
  mutate(income2019 = case_when(income2019<30000 ~ 30000,
                                      TRUE ~ income2019)) %>% #bottom coding
   mutate(income2019 = case_when(
     income2019==30000 ~ income2019 + rnorm(n(),0,2000),
     income2019==220000 ~ income2019 + rnorm(n(),0,10000),
       TRUE ~ income2019                               
     )) %>% #perturbation
  mutate(income2020 = case_when(
     QID21<30000 & QID21!=0 ~ 30000 + rnorm(n(),0,2000),
     QID21>200000 ~ 220000 + rnorm(n(),0,10000),
     QID21!=0 ~ QID21 + rnorm(n(),0,500),
     TRUE ~ QID21
     )) %>% #perturbation
  mutate(income2021 = case_when(
     QID19<30000 & QID19!=0 ~ 30000 + rnorm(n(),0,2000),
     QID19>200000 ~ 220000 + rnorm(n(),0,10000),
     QID19!=0 ~ QID19 + rnorm(n(),0,500),
     TRUE ~ QID19
     ))
#generalization
survey4 <- survey3 %>%
  mutate(LocationLongitude = round(LocationLongitude, 0),
    LocationLatitude=round(LocationLatitude,0))
# postcode is not important for the analysis. 
#many ones. best option swapping
survey5 <- survey4 %>%
    group_by(LocationLongitude, LocationLatitude) %>%
mutate(postal_code = sample(QID15,n(),replace = TRUE)) %>%
  ungroup()
# de-identifying unfinished records
set.seed(109)
survey6 <- survey5 %>%
  mutate(StartDate = case_when(Progress!=100 ~ sample(StartDate,n(),replace = TRUE),
                               TRUE ~ StartDate),
         Progress = case_when(Progress!=100 ~ round(Progress + rnorm(n(),0,6.2),1),
                              TRUE ~ Progress))
#removing original variables
survey7 <- survey6 %>%
  select(-c(QID6, QID22, QID21, QID19, QID15, income))

Check strategy

The published data is de-identified with respect to the demographics of the participants owing to the ethical responsibility and to comply with the Australian Privacy Principles (APPs) in the Privacy Act 1988. Ethics is a balance of risk and benefit. In this case the risk of identification is balanced with the utility of the data with respect to mental health, financial stability and home life.

There are complete forms and incomplete forms. However out of 1000 records, there are only 21 incomplete forms. The records of incomplete forms are listed below. Usually start date is same as end date however there are 29 participants with different start and end day and 16 people have unique start and end date that is different. all values were recorded on end date.

#finished=0
finished0 <- survey0  %>% filter(Finished == "0") #date not linked to finish status and 21 unfinished form.
q28na <- survey0  %>% filter(is.na(QID28)) #incomplete form when QID28 is not provided.

# survey0  %>% filter(Finished == "0") %>% right_join(survey0  %>% filter(is.na(QID24_1))) %>% view()
# 
# survey0  %>% filter(Finished == "0") %>% anti_join(survey0  %>% filter(is.na(QID24_1))) %>% view()
Unfinished Forms
id_number StartDate Finished IPAddress Progress Duration EndDate RecordedDate ResponseID ResponseLastName ResponseFirstName LocationLatitude LocationLongitude QID29 QID6 QID7 QID8 QID15 QID22 QID21 QID19 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27 QID28
77 2021-01-16 0 43.170.71.226 27.5 NA NA NA R_x5XZtitJAHZhZAII el-Murad Jenna -37.46126 145.5511 1 53 3 0 3185 95065.54 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
163 2021-01-11 0 106.251.13.93 66.0 NA NA NA R_sSPIRE1pw00RihqN Zamarripa Cortes Luke -38.07033 145.5446 1 38 2 3 3135 97347.76 113420.21 125811.61 2 6 4 6 1 0 0 0 0 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
204 2021-01-01 0 9.103.53.190 44.0 NA NA NA R_ouYXTSbjOaYrCNKu Tanghal Eric -38.14855 146.5748 1 52 1 1 3141 95782.19 113725.38 104729.83 2 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
208 2021-01-04 0 17.210.165.98 5.5 NA NA NA R_QOBNlCgAFzkVo2eP Nolan Lucas -37.70710 144.3773 1 40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
312 2021-01-07 0 45.230.13.12 88.0 NA NA NA R_OZXDhdo9fyickyeQ Bellendir Quiere -37.52825 145.6797 1 41 2 0 3086 146447.00 0.00 0.00 2 0 6 0 0 0 0 0 0 NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 0 0 1 1 0 NA NA NA
365 2021-01-02 0 13.252.239.50 60.5 NA NA NA R_vXV4PLF3swnUgd7Q Haro Logan -37.86608 144.2485 1 45 3 3 3154 78272.93 107102.10 108716.14 2 6 6 1 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
369 2021-01-10 0 10.184.236.167 0.0 NA NA NA R_n1ShMFRdZ15r6SlN Serna Macella -37.65061 144.2748 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
381 2021-01-23 0 32.212.15.245 44.0 NA NA NA R_4KWbbDZUcO6jTiV5 Macdonnell Taylor -38.02511 145.6101 1 52 3 1 3140 54781.61 79984.16 78989.82 3 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
396 2021-01-28 0 114.138.239.65 22.0 NA NA NA R_xJsYI9ign9xTms5n Romo-Garcia Zachary -37.91028 145.1299 1 46 3 1 3116 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
397 2021-01-30 0 62.153.117.209 88.0 NA NA NA R_HRFSnJIQcs7nfEHj Maisen Jennifer -38.35605 145.5131 1 67 1 2 3130 139224.43 153360.51 160608.00 4 3 6 1 1 0 0 0 0 1 0 1 0 1 1 3 1 0 0 0 0 1 1 0 0 1 0 1 NA NA NA
417 2021-01-20 0 125.76.148.119 77.0 NA NA NA R_cS4gYwe1S4zsH790 Sheader Sheyenne -38.28062 146.6651 1 58 1 0 3070 119540.94 142290.12 137121.28 3 6 5 2 1 0 1 0 0 1 0 0 1 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
447 2021-01-18 0 62.231.28.10 93.5 NA NA NA R_2seOpS2ckDrshHqZ Hayes-Joshua Andrew -38.36066 144.8651 1 68 4 0 3060 172802.08 0.00 186303.44 5 0 5 0 1 0 0 0 0 NA NA NA NA NA 1 NA 1 1 1 0 0 0 1 0 0 1 1 1 3 NA NA
476 2021-01-05 0 88.213.175.208 49.5 NA NA NA R_6F0s8anTmvVdj13W Gulaid Raihaana -37.59215 144.2300 1 63 2 1 3139 82773.11 0.00 107823.81 2 0 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
616 2021-01-11 0 116.231.99.134 16.5 NA NA NA R_YKQzAhF9NnXyJF9Q Crews Joseph -37.86295 146.8587 1 47 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
634 2021-01-25 0 67.185.88.151 22.0 NA NA NA R_ZGrTGzPgbG62E3L9 Simpson Dominick -37.81201 145.5292 1 57 1 0 3020 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
762 2021-01-13 0 23.75.109.91 88.0 NA NA NA R_Ids52Yg7RlJIFFTF Powell Vanessa -38.34920 144.5482 1 37 2 0 3206 54374.09 65419.80 73155.12 2 6 5 6 1 0 0 0 0 1 0 0 1 1 2 4 1 0 0 0 1 0 0 0 0 1 0 0 NA NA NA
772 2021-01-30 0 120.246.100.6 33.0 NA NA NA R_JuLJKSxavtTOzxeQ Lee Salaah -38.09860 144.7627 1 64 1 1 3191 125814.33 158441.66 139066.22 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
821 2021-01-30 0 45.202.3.2 88.0 NA NA NA R_TBfQF6IcuuQTXTsb Reeser Kelli -37.75127 145.3153 1 55 3 2 3028 58931.06 72406.46 70574.59 6 6 6 6 1 0 0 1 1 0 1 1 0 0 2 1 1 0 1 0 0 1 0 1 0 0 0 0 NA NA NA
963 2021-01-22 0 86.252.211.246 33.0 NA NA NA R_sQVBzRbe1t7XPLAz Whyte Hamdaan -37.74646 143.8936 1 48 1 2 3036 129949.86 167894.59 145883.32 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
985 2021-01-09 0 60.124.158.27 27.5 NA NA NA R_gkAPsLLLurGtfiNa el-Mady Johnathan -37.52656 147.0865 1 53 2 1 3036 104209.62 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
996 2021-01-10 0 126.82.35.85 55.0 NA NA NA R_K93D3dnd5lOqKQAE el-Salek Ailee -37.16393 145.6576 1 33 4 1 3012 128311.90 139351.79 150497.22 2 6 6 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#usually start date is same as end date however there are 29 participants with different start and end day
stnotend <- survey1  %>% filter(StartDate != EndDate) 
unique(stnotend$StartDate)
##  [1] "2021-01-03" "2021-01-27" "2021-01-04" "2021-01-16" "2021-01-29"
##  [6] "2021-01-14" "2021-01-26" "2021-01-30" "2021-01-31" "2021-01-12"
## [11] "2021-01-15" "2021-01-28" "2021-01-09" "2021-01-06" "2021-01-23"
## [16] "2021-01-21"
unique(stnotend$EndDate) #16 people have unique start and end date that is different.
##  [1] "2021-01-04" "2021-01-28" "2021-01-05" "2021-01-17" "2021-01-30"
##  [6] "2021-02-01" "2021-01-31" "2021-01-13" "2021-01-16" "2021-01-29"
## [11] "2021-01-10" "2021-01-14" "2021-01-07" "2021-01-15" "2021-01-24"
## [16] "2021-01-22"
recordnotend <- survey0  %>% filter(RecordedDate != EndDate) # all values recorded same as end day.
#All participants have agreed to informed consent
q29 <- unique(survey0$QID29) #answer=1
pct_miss(survey0$QID29)  #0
## [1] 0

In order to de-identify the records with unfinished status and to make the data more usable imputation was done to the demographic variables QID6, QID7, QID8 and QID15 for missing values. this was necessary because only one record with finished = 0 and progress = 0 existed. And other participants who did not submit the form would have identified themselves, especially when progress was low based on some combination of variables, especially income and age.

Unfinished Forms
id_number StartDate Finished Progress Duration EndDate RecordedDate LocationLatitude LocationLongitude QID29 QID6 QID7 QID8 QID15 QID22 QID21 QID19 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27
369 2021-01-10 0 0 NA NA NA -37.65061 144.2748 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Unfinished Forms
id_number StartDate Finished Progress Duration EndDate RecordedDate LocationLatitude LocationLongitude QID29 QID6 QID7 QID8 QID15 QID22 QID21 QID19 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27
77 2021-01-16 0 27.5 NA NA NA -37.46126 145.5511 1 53 3 0 3185 95065.54 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
208 2021-01-04 0 5.5 NA NA NA -37.70710 144.3773 1 40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
369 2021-01-10 0 0.0 NA NA NA -37.65061 144.2748 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
396 2021-01-28 0 22.0 NA NA NA -37.91028 145.1299 1 46 3 1 3116 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
616 2021-01-11 0 16.5 NA NA NA -37.86295 146.8587 1 47 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
634 2021-01-25 0 22.0 NA NA NA -37.81201 145.5292 1 57 1 0 3020 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
985 2021-01-09 0 27.5 NA NA NA -37.52656 147.0865 1 53 2 1 3036 104209.62 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Following steps were taken to de-identify the unfinished form records.

#following steps were taken to de-identify the unfinished form records 

#imputing missing demographics for unfilled forms
survey2 <- survey1 %>%
  mutate(QID6=case_when(Progress<10  ~ median(QID6, na.rm = TRUE),
                        TRUE ~ QID6),
         QID7=case_when(Progress<10 ~ median(QID7, na.rm = TRUE),
                        TRUE ~ QID7),
         QID8=case_when(Progress<10 ~ median(QID8, na.rm = TRUE),
                        TRUE ~ QID8),
         QID15=case_when(Progress<10 ~ median(QID15, na.rm = TRUE),
                        TRUE ~ QID15)
         )

There are still 17 identifiable records, however missing demographics have been filled, therefore in part these records may not be identifiable. However on a closer look these records are indeed identifiable based on the age, start date and other non demographic variables.

Unfinished Forms
id_number StartDate Finished Progress Duration EndDate RecordedDate LocationLatitude LocationLongitude QID29 QID6 QID7 QID8 QID15 QID22 QID21 QID19 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27 Freq
163 2021-01-11 0 66.0 NA NA NA -38.07033 145.5446 1 38 2 3 3135 97347.76 113420.21 125811.61 2 6 4 6 1 0 0 0 0 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
208 2021-01-04 0 5.5 NA NA NA -37.70710 144.3773 1 51 2 1 3106 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
312 2021-01-07 0 88.0 NA NA NA -37.52825 145.6797 1 41 2 0 3086 146447.00 0.00 0.00 2 0 6 0 0 0 0 0 0 NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 0 0 1 1 0 NA NA 1
365 2021-01-02 0 60.5 NA NA NA -37.86608 144.2485 1 45 3 3 3154 78272.93 107102.10 108716.14 2 6 6 1 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
369 2021-01-10 0 0.0 NA NA NA -37.65061 144.2748 1 51 2 1 3106 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
396 2021-01-28 0 22.0 NA NA NA -37.91028 145.1299 1 46 3 1 3116 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
397 2021-01-30 0 88.0 NA NA NA -38.35605 145.5131 1 67 1 2 3130 139224.43 153360.51 160608.00 4 3 6 1 1 0 0 0 0 1 0 1 0 1 1 3 1 0 0 0 0 1 1 0 0 1 0 1 NA NA 1
417 2021-01-20 0 77.0 NA NA NA -38.28062 146.6651 1 58 1 0 3070 119540.94 142290.12 137121.28 3 6 5 2 1 0 1 0 0 1 0 0 1 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
447 2021-01-18 0 93.5 NA NA NA -38.36066 144.8651 1 68 4 0 3060 172802.08 0.00 186303.44 5 0 5 0 1 0 0 0 0 NA NA NA NA NA 1 NA 1 1 1 0 0 0 1 0 0 1 1 1 3 NA 1
476 2021-01-05 0 49.5 NA NA NA -37.59215 144.2300 1 63 2 1 3139 82773.11 0.00 107823.81 2 0 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
616 2021-01-11 0 16.5 NA NA NA -37.86295 146.8587 1 47 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
634 2021-01-25 0 22.0 NA NA NA -37.81201 145.5292 1 57 1 0 3020 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
762 2021-01-13 0 88.0 NA NA NA -38.34920 144.5482 1 37 2 0 3206 54374.09 65419.80 73155.12 2 6 5 6 1 0 0 0 0 1 0 0 1 1 2 4 1 0 0 0 1 0 0 0 0 1 0 0 NA NA 1
772 2021-01-30 0 33.0 NA NA NA -38.09860 144.7627 1 64 1 1 3191 125814.33 158441.66 139066.22 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
821 2021-01-30 0 88.0 NA NA NA -37.75127 145.3153 1 55 3 2 3028 58931.06 72406.46 70574.59 6 6 6 6 1 0 0 1 1 0 1 1 0 0 2 1 1 0 1 0 0 1 0 1 0 0 0 0 NA NA 1
963 2021-01-22 0 33.0 NA NA NA -37.74646 143.8936 1 48 1 2 3036 129949.86 167894.59 145883.32 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1
996 2021-01-10 0 55.0 NA NA NA -37.16393 145.6576 1 33 4 1 3012 128311.90 139351.79 150497.22 2 6 6 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1

Following the above observation this step was necessary and start dates were swapped only for unfinished forms as the end date is empty for unfinished forms and further progress was perturbed to de-identify records based on start date, finish status and progress.

#folowing the above observation this step was necessary and start dates were swapped only for unfinished forms as the end date is empty for unfinished forms and further progress was perturbed to deidentify records based on start date, finish status and progress.
# de-identifying unfinished records
set.seed(109)
survey6 <- survey5 %>%
  mutate(StartDate = case_when(Progress!=100 ~ sample(StartDate,n(),replace = TRUE),
                               TRUE ~ StartDate),
         Progress = case_when(Progress!=100 ~ round(Progress + rnorm(n(),0,6.2),1),
                              TRUE ~ Progress))

It was seen that there are unique records when age is combined with other variables. Hence age needed to be genralized to solve this identification risk.

ageadult %>%
  filter(n==1) %>%
    head() %>%
  knitr::kable(caption="Age and Adults",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
Age and Adults
QID6 QID7 n
22 1 1
22 3 1
23 2 1
24 3 1
25 1 1
25 2 1
agechildren %>%
    filter(n==1) %>%
    head() %>%
  knitr::kable(caption="Age and Children",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
Age and Children
QID6 QID8 n
23 1 1
24 1 1
25 0 1
25 1 1
25 2 1
26 1 1
agepost %>%
    filter(n==1) %>%
    head() %>%
  knitr::kable(caption="Age and Postal Code",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
Age and Postal Code
QID6 QID15 n
46 3000 1
68 3000 1
39 3001 1
53 3001 1
59 3001 1
64 3001 1

Records are identifiable where income is very high or very low and the combination of age and income. For example a young participant with high income is identifiable, participant abouve 80 are identifiable and so is the person with highest and lowest income. Also people tend to remeber their incomes.

survey0 %>%
  ggplot() +
  geom_point(aes(x=QID6, y=QID22)) + #TOP BOTTOM CODING SOLVES >200000 POINTS ARE HANDLED, FOR AGE 88
  ggtitle("Age vs Income2019")

survey0 %>%
  ggplot() +
  geom_point(aes(x=QID6, y=QID21)) +
  ggtitle("Age vs Income2020")

survey0 %>%
  ggplot() +
  geom_point(aes(x=QID6, y=QID19)) +
    ggtitle("Age vs Income2021")

This problem was solved by a series of generalization, perturbation and top and bottom coding.

#this problem is solved by a series of generalization, perturbation and top and bottom coding.
survey3 <- survey2 %>%
  mutate(age_group = as.character(cut(QID6,breaks = c(21,40,60,88)))) %>%  #generalization
  mutate(income = QID22 + rnorm(n(),0,5000)) %>% #perturbation
  mutate(income2019 = ifelse(income>200000, 220000, income)) %>% #top coding
  mutate(income2019 = case_when(income2019<30000 ~ 30000,
                                      TRUE ~ income2019)) %>% #bottom coding
   mutate(income2019 = case_when(
     income2019==30000 ~ income2019 + rnorm(n(),0,2000),
     income2019==220000 ~ income2019 + rnorm(n(),0,10000),
       TRUE ~ income2019                               
     )) %>% #perturbation
  mutate(income2020 = case_when(
     QID21<30000 & QID21!=0 ~ 30000 + rnorm(n(),0,2000),
     QID21>200000 ~ 220000 + rnorm(n(),0,10000),
     QID21!=0 ~ QID21 + rnorm(n(),0,500),
     TRUE ~ QID21
     )) %>% #perturbation
  mutate(income2021 = case_when(
     QID19<30000 & QID19!=0 ~ 30000 + rnorm(n(),0,2000),
     QID19>200000 ~ 220000 + rnorm(n(),0,10000),
     QID19!=0 ~ QID19 + rnorm(n(),0,500),
     TRUE ~ QID19
     ))

Now, records cannot be identified based on age.

#records cannot be identified based on age.
survey3 %>%
  filter(Progress!=100) %>%
  group_by(age_group) %>%
  mutate(Freq = n()) %>%
  ungroup() %>%
  filter(Freq == 1) %>%
    head() 
## # A tibble: 0 x 53
## # … with 53 variables: id_number <int>, StartDate <date>, Finished <int>,
## #   Progress <dbl>, Duration <dbl>, EndDate <date>, RecordedDate <date>,
## #   LocationLatitude <dbl>, LocationLongitude <dbl>, QID29 <dbl>, QID6 <dbl>,
## #   QID7 <dbl>, QID8 <dbl>, QID15 <int>, QID22 <dbl>, QID21 <dbl>, QID19 <dbl>,
## #   QID12 <int>, QID10 <int>, QID14 <int>, QID16 <dbl>, QID17_1 <int>,
## #   QID17_2 <int>, QID17_3 <int>, QID17_4 <int>, QID17_5 <int>, QID18_1 <int>,
## #   QID18_2 <int>, QID18_3 <int>, QID18_4 <int>, QID18_5 <int>, QID20 <int>,
## #   QID23 <int>, QID24_1 <int>, QID24_2 <int>, QID24_3 <int>, QID24_4 <int>,
## #   QID24_5 <int>, QID24_6 <int>, QID25_1 <int>, QID25_2 <int>, QID25_3 <int>,
## #   QID25_4 <int>, QID25_5 <int>, QID25_6 <int>, QID26 <int>, QID27 <int>,
## #   age_group <chr>, income <dbl>, income2019 <dbl>, income2020 <dbl>,
## #   income2021 <dbl>, Freq <int>

These are the records with this combination that occur only once in the data. however, income is perturbed and top and bottom coded and hence it is difficult for a specific person to identify himself. A participant cannot identify himself/herself accurately and can only identify with a low probability that the guess is correct since the start dates have been swapped and progress is also perturbed.

survey3 %>%
  filter(Progress!=100) %>%
  group_by(age_group,income2019,income2020,income2021) %>%
  mutate(Freq = n()) %>%
  ungroup() %>%
  filter(Freq == 1) %>%
  head() %>%
  knitr::kable(caption="age_group,income2019,income2020,income2021",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
age_group,income2019,income2020,income2021
id_number StartDate Finished Progress Duration EndDate RecordedDate LocationLatitude LocationLongitude QID29 QID6 QID7 QID8 QID15 QID22 QID21 QID19 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27 age_group income income2019 income2020 income2021 Freq
77 2021-01-16 0 27.5 NA NA NA -37.46126 145.5511 1 53 3 0 3185 95065.54 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA (40,60] 92133.53 92133.53 NA NA 1
163 2021-01-11 0 66.0 NA NA NA -38.07033 145.5446 1 38 2 3 3135 97347.76 113420.21 125811.61 2 6 4 6 1 0 0 0 0 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA (21,40] 96143.68 96143.68 113359.16 125531.71 1
204 2021-01-01 0 44.0 NA NA NA -38.14855 146.5748 1 52 1 1 3141 95782.19 113725.38 104729.83 2 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA (40,60] 96076.48 96076.48 112669.94 104643.12 1
312 2021-01-07 0 88.0 NA NA NA -37.52825 145.6797 1 41 2 0 3086 146447.00 0.00 0.00 2 0 6 0 0 0 0 0 0 NA NA NA NA NA 1 NA 0 0 1 0 0 0 0 0 0 1 1 0 NA NA (40,60] 151564.74 151564.74 0.00 0.00 1
365 2021-01-02 0 60.5 NA NA NA -37.86608 144.2485 1 45 3 3 3154 78272.93 107102.10 108716.14 2 6 6 1 1 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA (40,60] 73588.17 73588.17 106749.46 107878.94 1
381 2021-01-23 0 44.0 NA NA NA -38.02511 145.6101 1 52 3 1 3140 54781.61 79984.16 78989.82 3 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA (40,60] 58250.21 58250.21 79380.18 79899.97 1

There is uncertainty with respect to the income and age due to perturbation and top and bottom coding of income. There are empty values in income2019 nor are there zeroes. However, there is emptiness and zeroes in income2020 and income 2021. Incomes which were zero/empty are not modified as these incomes are an indicator of financial stability and represents the impact of covid 19 and the pandemic.

Unusual growth in income not seen explicitly and if the income goes down to zero, there are multiple records with the same trend and with the incomes being coded and perturbed, a participant can only identify himself with a low probability that the guess is correct.

incomemax %>%
  head() %>%
  knitr::kable(caption="High income records",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
High income records
id_number age_group income2019 income2020 income2021
4 (40,60] 229863.9 216441.8 210656.6
22 (40,60] 206521.5 240443.1 218106.2
34 (21,40] 214105.7 225504.5 222609.2
49 (40,60] 204520.9 213092.3 223316.6
52 (40,60] 224439.8 221099.5 228261.7
86 (21,40] 218105.2 225305.4 224509.0
incomemin %>%
  head() %>%
  knitr::kable(caption="Low income records",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
Low income records
id_number age_group income2019 income2020 income2021
20 (40,60] 32878.46 67221.82 54796.84
28 (40,60] 31594.98 41439.08 38472.45
81 (21,40] 36647.17 43265.86 63281.51
115 (40,60] 30362.47 0.00 0.00
313 (40,60] 47474.36 64344.64 56217.08
327 (40,60] 33540.86 50264.62 46129.92
survey3 %>%
  ggplot() +
  geom_point(aes(x=QID6, y=income2019))+
    ggtitle("Age vs Income2019")

survey3 %>%
  ggplot() +
  geom_point(aes(x=QID6, y=income2020))+
    ggtitle("Age vs Income2020")

survey3 %>%
  ggplot() +
  geom_point(aes(x=QID6, y=income2021))+
    ggtitle("Age vs Income2021")

The analysis below suggests that records cannot be identified based on the changes in incomes from 2019 to 2021.

income000 <- survey0 %>% filter(QID22 == 0 & QID21 == 0 & QID19 == 0)
income001 <- survey0 %>% filter(QID22 == 0 & QID21 == 0) # no 0 in 2019

#29 participants with both income2020 and income2021=0, however there are multiple such records for comparative incomes.
income100 <- survey0 %>% filter(QID19 == 0 & QID21 == 0) 

income010 <- survey0 %>% filter(QID19 == 0 & QID22 == 0) # no records

#100 participants with both income2021=0, however there are multiple such records for comparative incomes. 312 only one with finish = 0 and income20&21=0
income110 <- survey0 %>% filter(QID19 == 0) 


#261 participants with both income2020=0, however there are multiple such records for comparative incomes.
income101 <- survey0 %>% filter(QID21 == 0)

#Hence records cannot be identified based on the change in incomes from 2019 to 2021.

It can be seen that many there are unique postal codes to many of the participants and postal code is not important as much as the longitude and latitude coordinates, Therefore the postal code have been swapped to de-identify participants based on postal codes.

Postal Codes Count
QID15 n
3027 1
3076 1
3088 1
3181 1
3188 1
3190 1
Postal Codes and Age
age_group QID15 n
(21,40] 3001 1
(21,40] 3007 1
(21,40] 3010 1
(21,40] 3011 1
# postcode is not important for the analysis. 
#many ones. best option swapping
survey5 <- survey4 %>%
    group_by(LocationLongitude, LocationLatitude) %>%
mutate(postal_code = sample(QID15,n(),replace = TRUE)) %>%
  ungroup()

The longitude and latitude coordinates privided in the raw_data are precise and thus participants are indentifiable with respect to these coordinates. However, these coordinates are very much essential to analyze the data spatially, hence these coordinates are rounded to their integer values and thus represent data records for 110 km radius.

#generalization
survey4 <- survey3 %>%
  mutate(LocationLongitude = round(LocationLongitude, 0),
    LocationLatitude=round(LocationLatitude,0))

After applying all the de-identification strategies step by step there are however two records that are identifiable. The only participant in age group (60,88] with 5 adults in the household and the only participant in age group (21,40] with 4 children in the household. However, it would be very difficult to identify these records as the postal codes were swapped, incomes were perturbed and top and bottom coded and as such only a probability will be associated with the identification of these records. I did want to swap these variables for the utility of the data would be compromised owing to the fact that the home life and mental health is actually affected by the children and adults in any household. Swapping closely will however deiidentify these 2 records. “adultchildrenswap” contains the fully de-identified data.

survey7 %>%
  group_by(age_group,QID7) %>%
  mutate(Freq = n()) %>%
  ungroup() %>%
  filter(Freq == 1) %>%
  knitr::kable(caption="age_group, adults",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
age_group, adults
id_number StartDate Finished Progress Duration EndDate RecordedDate LocationLatitude LocationLongitude QID29 QID7 QID8 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27 age_group income2019 income2020 income2021 postal_code Freq
504 2021-01-21 1 100 2701.299 2021-01-21 2021-01-21 -37 144 1 5 1 2 4 6 0 1 0 0 0 1 1 1 0 1 0 1 4 0 0 1 0 0 0 1 0 0 1 0 0 1 2 (60,88] 80565.26 102191.6 0 3050 1
survey7 %>%
  group_by(age_group,QID8) %>%
  mutate(Freq = n()) %>%
  ungroup() %>%
  filter(Freq == 1) %>%
  knitr::kable(caption="age_group, children",booktabs = TRUE) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), latex_options = "hold_position")
age_group, children
id_number StartDate Finished Progress Duration EndDate RecordedDate LocationLatitude LocationLongitude QID29 QID7 QID8 QID12 QID10 QID14 QID16 QID17_1 QID17_2 QID17_3 QID17_4 QID17_5 QID18_1 QID18_2 QID18_3 QID18_4 QID18_5 QID20 QID23 QID24_1 QID24_2 QID24_3 QID24_4 QID24_5 QID24_6 QID25_1 QID25_2 QID25_3 QID25_4 QID25_5 QID25_6 QID26 QID27 age_group income2019 income2020 income2021 postal_code Freq
656 2021-01-14 1 100 7072.893 2021-01-15 2021-01-15 -38 144 1 3 4 2 6 4 2 1 0 0 0 1 1 0 1 1 0 3 3 0 0 1 0 0 0 1 0 0 1 0 0 3 3 (21,40] 73937.17 90419.68 101427.4 3162 1
adultchildrenswap <-survey7 %>%
  group_by(age_group) %>%
  mutate(QID7 = sample(QID7,n(),replace = TRUE)) %>%
  mutate(QID8 = sample(QID8,n(),replace = TRUE)) %>%
  ungroup()

Computer readable structure

An important understanding of this published data is that it conforms to the tidy data definition: each variable is in its own column, each observation is in its own row and each value is in its own cell. Owing to this norm, multiple records exist for the same participant. Therefore for each participant the income occurs for 3 years, 2019, 2020 and 2021; the combination of “workfromhome,” “hrperweek,”
“wrkschedulestability” and “mentalhealth” occur twice, each for the year 2019 and 2020; “wrktimechoice” occurs ten times(five for each year, 2019 and 2020); and “homelife” occurs twelve times(six for each year, 2019 and 2020). Therefore there are 3x5x6, 90 records for each participant and redundancy must be considered accordingly. Additionally, “wrktimechoiceeoption” and “homelife” is provided to refer if that option was selected by the participant for “wrktimechoice”and “homelife” respectively. If an option was selected then the option number is the value otherwise 0. An important understanding here while filtering records here is to understand that there are participants who have not responded to any of the options and therefore for that particular year, all of the options( [1,5] for “wrktimechoice” and [1,6] for “homelife”) will be 0 which is equivalent to NA, that is, this question was not answered by the participant. On the other hand, a participant had answered the question if any one of the option is not 0.

All records are true with respect to the original survey data apart from the few perturbations and swapping in the demographics section that was done to de-identify the data. No new or faulty records exist in the data. This can be verified by checking with respect to the id_number which is same in both datasets. Ideally, the id_number also had to be reordered, it however present to validate the correctness of the published data.

survey8 <- survey7 %>% #pivoting income in 3 years
  rename(`2019` = income2019,
         `2020` = income2020,
         `2021` = income2021) %>%
  pivot_longer(cols = c(`2019`, `2020`, `2021`),
               names_to = "year",
               values_to = "income") %>% #12,10
  rename(`2019` = QID12,                   
         `2020` = QID10,) %>%
  pivot_longer(cols = c(`2019`, `2020`),
               names_to = "workfromhome",
               values_to = "option") %>%
  mutate(workfromhome=case_when(year==2021 ~ year,
         TRUE ~ workfromhome)) %>%
  filter(year == workfromhome) %>%
  mutate(workfromhome=as.character(option)) %>%
  mutate(workfromhome=case_when(year==2021 ~ "NA",
                                TRUE ~ workfromhome)) %>% 
  replace_with_na_at(.vars = c("workfromhome"),
                     condition = ~.x == "NA") %>%
  select(-option) %>%
  distinct() %>%                      #14,16
  rename(`2019` = QID14,
         `2020` = QID16,) %>%
  pivot_longer(cols = c(`2019`, `2020`),
               names_to = "hrperweek",
               values_to = "option") %>%
  mutate(hrperweek=case_when(year==2021 ~ year,
         TRUE ~ hrperweek)) %>%
  filter(year == hrperweek) %>%
  mutate(hrperweek=as.character(option)) %>%
  mutate(hrperweek=case_when(year==2021 ~ "NA",
                                TRUE ~ hrperweek)) %>% 
  replace_with_na_at(.vars = c("hrperweek"),
                     condition = ~.x == "NA") %>%
  select(-option) %>%
  distinct() %>%                      #20,23
  rename(`2019` = QID20,
         `2020` = QID23,) %>%
  pivot_longer(cols = c(`2019`, `2020`),
               names_to = "wrkschedulestability",
               values_to = "option") %>%
  mutate(wrkschedulestability=case_when(year==2021 ~ year,
         TRUE ~ wrkschedulestability)) %>%
  filter(year == wrkschedulestability) %>%
  mutate(wrkschedulestability=as.character(option)) %>%
  mutate(wrkschedulestability=case_when(year==2021 ~ "NA",
                                TRUE ~ wrkschedulestability)) %>% 
  replace_with_na_at(.vars = c("wrkschedulestability"),
                     condition = ~.x == "NA") %>%
  select(-option) %>%
  distinct() %>%                      #26,27
  rename(`2019` = QID26,
         `2020` = QID27,) %>%
  pivot_longer(cols = c(`2019`, `2020`),
               names_to = "mentalhealth",
               values_to = "option") %>%
  mutate(mentalhealth=case_when(year==2021 ~ year,
         TRUE ~ mentalhealth)) %>%
  filter(year == mentalhealth) %>%
  mutate(mentalhealth=as.character(option)) %>%
  mutate(mentalhealth=case_when(year==2021 ~ "NA",
                                TRUE ~ mentalhealth)) %>% 
  replace_with_na_at(.vars = c("mentalhealth"),
                     condition = ~.x == "NA") %>%
  select(-option) %>%
  distinct()
 

#3000 records created. 
#takes time to execute
survey9 <- survey8 %>%  #17,18
  rename(`2019_1`= QID17_1,
         `2019_2`= QID17_2,
         `2019_3`= QID17_3,
         `2019_4`= QID17_4,
         `2019_5`= QID17_5,
         `2020_1`= QID18_1,
         `2020_2`= QID18_2,
         `2020_3`= QID18_3,
         `2020_4`= QID18_4,
         `2020_5`= QID18_5
         ) %>%
  pivot_longer(cols = c(starts_with("20")),
               names_to = "wrktimechoice",
               values_to = "option") %>%
  separate(col = wrktimechoice,
           into = c("tempyear","wrktimechoice"),
           sep = "_") %>%
  mutate(tempyear=case_when(year==2021 & tempyear==2019 ~ year,
         TRUE ~ tempyear),
         option=as.character(option)) %>%
  mutate(wrktimechoice=case_when(year==2021 ~ "NA",
                                TRUE ~ wrktimechoice),
         option=case_when(year==2021 ~ "NA",
                                TRUE ~ option)) %>% 
  replace_with_na_at(.vars = c("wrktimechoice","option"),
                     condition = ~.x == "NA") %>%
  mutate(option=as.integer(option)) %>%
  mutate(wrktimechoice=case_when(option==0 ~ "0",
                                 TRUE ~ wrktimechoice)) %>%
  filter(year == tempyear) %>%
  mutate(wrktimechoice=case_when(is.na(option) ~ "NA",
                                 TRUE ~ wrktimechoice)) %>%
  replace_with_na_at(.vars = c("wrktimechoice"),
                     condition = ~.x == "NA") %>% 
  mutate(wrktimechoice=as.character(wrktimechoice)) %>%
  arrange(id_number) %>%
  select(-c(tempyear,option))

survey10 <- survey9%>%
  mutate(wrktimechoiceoption=as.character(rep(1:5, 3000)))

#15000 records created
#takes time to execute
survey11 <- survey10 %>%  #17,18
  rename(`2019_1`=QID24_1,
         `2019_2`=QID24_2,
         `2019_3`=QID24_3,
         `2019_4`=QID24_4,
         `2019_5`=QID24_5,
         `2019_6`=QID24_6,
         `2020_1`=QID25_1,
         `2020_2`=QID25_2,
         `2020_3`=QID25_3,
         `2020_4`=QID25_4,
         `2020_5`=QID25_5,
         `2020_6`=QID25_6
         ) %>%
  pivot_longer(cols = c(starts_with("20")),
               names_to = "homelife",
               values_to = "option") %>%
  separate(col = homelife,
           into = c("tempyear","homelife"),
           sep = "_") %>%
  mutate(tempyear=case_when(year==2021 & tempyear==2019 ~ year,
         TRUE ~ tempyear),
         option=as.character(option)) %>%
  mutate(homelife=case_when(year==2021 ~ "NA",
                                TRUE ~ homelife),
         option=case_when(year==2021 ~ "NA",
                                TRUE ~ option)) %>% 
  replace_with_na_at(.vars = c("homelife","option"),
                     condition = ~.x == "NA") %>%
  mutate(option=as.integer(option)) %>%
  mutate(homelife=case_when(option==0 ~ "0",
                                 TRUE ~ homelife)) %>%
  filter(year == tempyear) %>%
  mutate(homelife=case_when(is.na(option) ~ "NA",
                                 TRUE ~ homelife)) %>%
  replace_with_na_at(.vars = c("homelife"),
                     condition = ~.x == "NA") %>% 
  mutate(homelife=as.character(homelife)) %>%
  arrange(id_number) %>%
  select(-c(tempyear,option))

#90000 records created
#takes time to execute
survey12 <- survey11 %>%
  mutate(homelife=case_when(is.na(wrktimechoice) ~ "NA",
                                 TRUE ~ homelife)) %>%
   replace_with_na_at(.vars = c("homelife"),
                     condition = ~.x == "NA") %>% 
  mutate(homelife=case_when(id_number %in% c(163,365,417) ~ "NA",
                                 TRUE ~ homelife)) %>%
   replace_with_na_at(.vars = c("homelife"),
                     condition = ~.x == "NA") 
survey13 <- survey12%>%
  mutate(homelifeoption=as.character(rep(1:6, 15000)),
         wrktimechoiceoption=as.character(wrktimechoiceoption))
survey14 <- survey13 %>% 
  rename(informed_consent=QID29,
         adultcount=QID7,
         childcount=QID8)

colorder <- c(1:10,13,11,12,14,15:18,21,22,19,23,24,20)
survey15 <- survey14[,colorder]

Recoding factor levels

survey16 <- survey15 %>%
  mutate(workfromhome = fct_recode(workfromhome, 
                                   "Didn’t work in this year"="0",
                                   "Didn’t work in this year"="1",
                                   Never = "2",
                                   Sometimes = "3",
                                   "About half the time" = "4",
                                   "Most of the time" = "5",
                                   Always = "6"
                                   ),
         hrperweek = fct_recode(hrperweek, 
                                   "Didn’t work in this year"="0",
                                   "Didn’t work in this year"="1",
                                   "less than 10 hours" = "2",
                                   "10-20 hours" = "3",
                                   "20-30 hours" = "4",
                                   "30-40 hours" = "5",
                                   "40+ hours" = "6"
                                   ),
         wrktimechoice = fct_recode(wrktimechoice, 
                                   "Traditional work hours (9-5, Mon - Fri)"="1",
                                   "Early mornings (5am - 9am)" = "2",
                                   "Early evenings (5pm - 9pm)" = "3",
                                   "Late evenings (5pm - midnight)" = "4",
                                   "Overnight (midnight - 5am)" = "5",
                                   ),
         wrktimechoiceoption = fct_recode(wrktimechoiceoption, 
                                   "Traditional work hours (9-5, Mon - Fri)"="1",
                                   "Early mornings (5am - 9am)" = "2",
                                   "Early evenings (5pm - 9pm)" = "3",
                                   "Late evenings (5pm - midnight)" = "4",
                                   "Overnight (midnight - 5am)" = "5",
                                   ),
         wrkschedulestability = fct_recode(wrkschedulestability, 
                                   "No changes"="1",
                                   "Some small changes" = "2",
                                   "Varying" = "3",
                                   "Unpredictable" = "4",
                                   ),
         homelife = fct_recode(homelife, 
                                   "Comfortable"="1",
                                   "Lonely" = "2",
                                   "Active" = "3",
                                   "Connected" = "4",
                                   "Peaceful" = "5",
                                   "Chaotic" = "6"
                                   ),
         homelifeoption = fct_recode(homelifeoption, 
                                   "Comfortable"="1",
                                   "Lonely" = "2",
                                   "Active" = "3",
                                   "Connected" = "4",
                                   "Peaceful" = "5",
                                   "Chaotic" = "6"
                                   ),
         mentalhealth = fct_recode(mentalhealth, 
                                   "Good"="1",
                                   "Some challenges" = "2",
                                   "Significant challenges" = "3"
                                   )
         )

Save data in a csv form in the data folder

write.csv(survey16, here::here("data/release-data-Mohammed-Faizan.csv"))
datadict <- read_excel(here::here("raw_data/datadictionary_etc5512_assignment3.xlsx"), sheet=1)
codebook <- read_excel(here::here("raw_data/datadictionary_etc5512_assignment3.xlsx"), sheet=2)

write.csv(datadict, here::here("data/data-dictionary-Mohammed-Faizan.csv"))
write.csv(codebook, here::here("data/codebook-Mohammed-Faizan.csv"))

Resources

Survey Data Source

  • “Kennedy L.A., (2021). Simulated Data Survey”

R packages

R Core Team (2021) Xie (2021a) Dietrich (2020) Wickham et al. (2021), Wickham (2021a), Wickham and Bryan (2019) Robinson and Silge (2021) Wickham and Miller (2021) Hester (2020) Fabri (2020) Wickham et al. (2020), Zhu (2021), Xie (2021b), Tierney et al. (2020), Henry and Wickham (2020), Wickham and Hester (2020), Wickham (2019), Müller and Wickham (2021), Wickham (2021b), Wickham (2021c), Tierney (2019), Xie (2016), Wickham (2016), Xie (2015), Xie (2014), Wickham et al. (2019), Tierney (2017) Wickham (2021a)

Dietrich, Jan Philipp. 2020. Citation: Software Citation Tools. https://CRAN.R-project.org/package=citation.
Fabri, Antoine. 2020. Unglue: Extract Matched Substrings Using a Pattern. https://CRAN.R-project.org/package=unglue.
Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Hester, Jim. 2020. Glue: Interpreted String Literals. https://CRAN.R-project.org/package=glue.
Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David, and Julia Silge. 2021. Tidytext: Text Mining Using Dplyr, Ggplot2, and Other Tidy Tools. https://github.com/juliasilge/tidytext.
Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames.” JOSS 2 (16): 355. https://doi.org/10.21105/joss.00355.
———. 2019. Visdat: Preliminary Visualisation of Data. https://CRAN.R-project.org/package=visdat.
Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2020. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://github.com/njtierney/naniar.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
———. 2021a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2021b. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
———. 2021c. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Jennifer Bryan. 2019. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2020. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Jim Hester. 2020. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Wickham, Hadley, and Evan Miller. 2021. Haven: Import and Export SPSS, Stata and SAS Files. https://CRAN.R-project.org/package=haven.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/bookdown.
———. 2021a. Bookdown: Authoring Books and Technical Documents with r Markdown. https://CRAN.R-project.org/package=bookdown.
———. 2021b. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.